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Abstract 

We propose an automated and unsupervised method- 
ology for a novel summarization of group behavior 
based on content preference. We show that graph the- 
oretical community evolution (based on similarity of 
user preference for content) is effective in indexing 
these dynamics. Combined with text analysis that tar- 
gets automatically-identified representative content for 
each community, our method produces a novel multi- 
layered representation of evolving group behavior. We 
demonstrate this methodology in the context of politi- 
cal discourse on a social news site with data that spans 
more than four years and find coexisting political lean- 
ings over extended periods and a disruptive external 
event that lead to a significant reorganization of exist- 
ing patterns. Finally, where there exists no ground truth, 
we propose a new evaluation approach by using en- 
tropy measures as evidence of coherence along the evo- 
lution path of these groups. This methodology is valu- 
able to designers and managers of online forums in need 
of granular analytics of user activity, as well as to re- 
searchers in social and political sciences who wish to 
extend their inquiries to large-scale data available on the 
web. 



1 Introduction 

Online forums and social news sites have created new 
spaces for user interaction that can influence millions of in- 
dividuals as well as traditional media platforms. The impor- 
tance of these spaces is evident in the surge of web- data anal- 
ysis throughout the 2012 US presidential election (Metaxas 
|and Mustafaraj 201 2p| These spaces of discussion and in- 
formation sharing provide large datasets that can be inves- 
tigated by scholars in various fields. While important ques- 
tions have been formed and extensively studied by social 
scientists in smaller scales, any such attempt on the web is 
met with the computational challenges of processing and ab- 
stracting very large, complex, and often noisy data, render- 
ing methods developed for smaller scales impractical. 

While an array of techniques have been developed for 
generating a variety of summary statistics from large-scale 
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data, they all fall short of offering a complete multi-layered 
summary that enables scholarly investigation in social sci- 
ences or allows the owner of a site to understand her user 
base in terms of how they interact with content and with 
each other, and how such interaction patterns evolve over 
time. Consider a website with many users who share arti- 
cles online and express their opinions in various ways. Other 
than the unscalable approach of manually and laboriously 
following almost all the activities of the user population and 
becoming experts on the related topics, how would one be- 
gin to understand its dynamics? One can begin by reporting 
simple statistical measures (most popular articles, most ac- 
tive or most influential users, percentage of items contain- 
ing some keyword, increase or decrease in activity), employ 
language processing to measure positive or negative senti- 
ment, detect topics of discussion, or use regression to model 
or predict specific measures. Like the parable of Blind Men 
and the Elephan^j these techniques provide us with disjoint, 
specific pieces of information. We believe there is a need 
for development of automated tools that are not manually 
coded with domain- specific knowledge (hence, applicable 
to sites across several verticals), and yet provide a top-down 
summary of user dynamics. Such an automated multi-scale 
summary then can facilitate a more granular exploration. To 
the best of our knowledge, such a framework has not been 
offered by the scientific community. 

We use explicit indicators of user preference for content 
as the basis for our methodology. Some examples of such in- 
dicators are the "Like" button in Facebook, an "up" vote in 
reddit, a "+1" in Google-plus, or a "digg" on Digg. We will 
call these indicators votes in the context of this paper and 
will use them as clear and simple signals that can be used to 
infer user orientation toward content. For example, intuition 
suggests that users who prefer and promote the same politi- 
cal articles will have similar political leanings, whereas ex- 
plicit friendships do not necessarily suggest similar political 
orientations. 

Based on votes cast by users in a bipartite network of 
users and articles, we detect communities of users with sim- 
ilar voting patterns and track these communities' temporal 
evolution. We then identify representative content for each 
community based on their votes, and perform more detailed 
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analysis on text and source of these representative sets, teas- 
ing out persistent theme^j Once a summary of the evolving 
groups has been formed, several other interesting questions 
can be formed: Do users form polarized and insular groups? 
Does one group dominate or drive out other groups? Is there 
movement between groups? How can we design online com- 
munities to foster cross-group understanding? How do ex- 
ternal events affect these dynamics? What are the evolving 
interest patterns and what is driving them? 

We apply this methodology to a social news site, named 
BalataritQ (translated The Highest), which is a mainly 
Persian-language website. This platform is suitable for our 
purpose because it played a significant role during an im- 
portant political event, the Iranian post-election uprising in 
2009, dubbed the Green Movement. Balatarin became a hub 
for disseminating information and a space for people to ex- 
change opinions, propose ideas and even organize to take 
action to protest in the real world. Some of the more well- 
known US -based examples of similar social news sites are 
Reddif] Slashdof]and Dig£] 

Our methodology: 1) produces a novel visualization of 
political dynamics throughout the 4-year duration of the 
data, 2) finds politics-based evolution paths in multi-issue 
contexts, and 3) extracts user preferences for text and source 
of content. We are able to observe the patterns at different 
granularities by producing summaries at multiple scales and 
at different times. We focus on four example paths and show 
that as much as 40% of users stayed in the same path af- 
ter one year, indicating an implicit yet enduring community 
of users with consistently similar preference for content. We 
also find highly specific and persistent themes within some 
paths, relating to issues (such as international relations) or 
political orientations (such as pro Green Movement). The 
visualizations shows that an external event (post-election up- 
rising) had a sudden effect on these dynamics, causing ma- 
jor reorganization of communities. Finally, we evaluate the 
coherence within each path by studying the entropy of pub- 
lication sources from representative content and find that re- 
currence of domains within detected paths doubles, triples, 
or quadruples compared with articles drawn at random. We 
find that the paths are not insular and there are merges be- 
tween them as well as content overlaps. No one group or 
path becomes so dominant as to drive out others, however, 
following the election crisis there is a shift in focus and paths 
reorganize around the Green Movement. The authors find it 
very appealing and instructive that such a detailed summary 
could be reconstructed by employing a completely unsuper- 
vised and automated set of tools that assumes no knowledge 
of the underlying events or the background of the users. 
Presented with such a summary, a decision maker or a re- 
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searcher can then dig deeper and fill in the relationships and 
connections with the external events that the group was re- 
sponding to and participating in. The detailed results and 
associated commentaries are discussed in section 13.21 and 
demonstrated in Figure [5] 

In the next section we explain the steps of the method- 
ology and include our proposed evaluation method. Section 
[3] describes the implementation of the methodology on our 
dataset and details its results. Section|5]presents an overview 
of related work and Section [6] discusses further ideas and 
concludes the paper. 

2 Description of Methodology 

In this section we will describe the steps of the methodology: 
defining the network and implementation of community de- 
tection and evolution in successive times. We then produce 
content summaries of evolving communities and propose a 
method to evaluate the results. 

2.1 Community Evolution 

To group users who vote similarly, we define a bipartite net- 
work of users and articles where each edge is a vote cast 
by a user to an article. Figure [T] illustrates this structure. 
We project this bipartite network onto a weighted unipar- 
tite (single-mode) graph consisting of users only, where the 
weight of an edge between two users reflects how similarly 
they vote. The edge weight between a pair of users (#, y) is 
computed using the Jaccard Index: 
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where X and Y are sets of articles voted for by user x 
and y respectively, and n stands for set cardinality. 



Users 




Articles 




Figure 1: (Left) Bipartite graph of users and articles. (Right) 
Example of projected graph of users and the communities 
found in a one month time frame of data, each community 
in a different color. 



In the study of network topologies one of the most widely 
used measures of community formation is the modularity 
metric (Girvan and Newman 2002), which compares the 
number of edges between vertices belonging to the same 
community to the expected number of edges among the 
same nodes in a null model- i.e. a random graph with the 



same degree sequence. We use the expre ssion for modular- 
ity of a weighted graph defined in (Newman 2004) as: 
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where Wij is the weight of edges between vertices i and j, W 
is the sum of the weights of all edges and Si is the strength of 
vertex i defined as the sum of the weights of edges adjacent 
to the vertex, d is the community that vertex i belongs to 
and S is the Kronecker delta. The expression ^f computes 
the expected number of edges between vertices i and j in the 
null model. 

To find sequences of such vote-based communities, we 
first construct bipartite graphs and their single-mode projec- 
tions for the data in consecutive time fra mes. Then, using a 
fast modularity maximization algorithm (Clauset, Newman, 
and Moore 2004), we find communities for each time frame- 
Figure [T] shows a visual example of communities found in a 
one month time frame of our dataset that will be described 
in detail later in the paper. 

For every pair of successive time frames we compute tran- 
sition probabilities between every community pair d and 
Cj in times t\ and t<i and construct a matrix of transition 
probabilities. More specifically, each element P^ is com- 
puted as: 
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In this matrix, the Cj (£2) with largest transition probabil- 
ity from Ci(ti) is the the community in £2 where most of 
the users in d in the previous window move to. Based on 
highest transition probabilities for every pair of communi- 
ties in consecutive times, we create a visualization of paths 
of opinion-based communities. 

2.2 Representative Content 

The evolving communities detected in the previous section 
will define the skeleton of voting behavior among users. In 
order to characterize the nature of detected communities and 
add a layer of meaning, we first find the articles most pre- 
ferred by each community. 

Intuitively, articles preferred by a community will demon- 
strate a high level of (positive) deviation from the number 
of votes they are expected to receive from that community. 
Considering the network of communities and articles with 
each vote connecting a community to an article (Figure [2]), 
one can construct a random graph such that the degree se- 
quences (i.e. number of votes cast by users in communities 
and received by articles) are preserved. The random graph 
is created by connecting an edge coming out of a commu- 
nity to one going into an article uniformly at random. We 
compute the expected number of edges between communi- 
ties and articles in this random graph and find the deviation 
from the true number of edges observed in the data. In the 
random graph, the expected number of votes given to article 
A from users in community C will equal: 

E{A,C)= n ^n{A) 



where n(C) is the total number of votes cast by users in 
community C, n(A) is the total number of votes received by 
article A, and N is the total number of votes cast by all users 
to all articles. 
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Figure 2: Graph of user communities (shaded ovals) and ar- 
ticles. 

The deviation score can be computed using the following 
expression: 

where 0(A, C) is the observed number of votes received by 
article A that come from users in community C. We compute 
this for the cases where 0(A,C) > E(A,C) in order to 
obtain only those observations that are more popular than 
expected. This score is inspired by Pearson's x 2 test statistic 
( [Pearson 1900] ). 

Using this expression, we rank articles for every commu- 
nity belonging to a time frame and create a list of most rep- 
resentative articles for each community. We expect that the 
articles representing each community will have similarities 
in their content that signify a difference from other commu- 
nities, and that this preference within each community will 
carry over through the whole evolution path. 

We will now extract a more granular characterization of 
content reflected in each evolution path. For this purpose, 
we consider the ranked list of representative articles within 
each community. Given that each article that is posted to the 
site includes a title and a summary of its contenjj we use a 
bag of words model to find the deviation between the words 
used in representative articles for a community and the rest 
of the articles posted in a time frame. To compute this devia- 
tion, we find term frequencies for all the words in each time 
frame. We then find term frequencies for the top representa- 
tive articles per community and normalize term frequencies 
between and 1 . Computing the difference between the two 
mentioned values provides a deviation score for each term: 

Score, = 'frf > - , ^L_ 

max T tf T ,c{t) max T tf T {t) 

where tfT,c(t) is the term frequency for term T in com- 
munity C at time £[j We rank the words that belong to each 

8 This is almost always the case in online forums and social 
news sites. 

9 Although this process of scoring is similar to term frequency- 
inverse document freq uency (tf-idf) weighting (Manning, Ragha- 
van, and Schtze 2008 ), note that we are not ranking documents and 
are instead finding a normalized ranking of terms only, so we do 
not use inverse-document-frequencies. 



community based on their score. We can regard this ranked 
list of terms as automatically generated summary tags for 
each community. Similarly, aggregating top words for each 
community along its evolution path and finding the most fre- 
quent terms over each path will automatically generate sum- 
mary tags for each evolution path. 

Finally, each article includes a URL link to its source of 
publication. Extracting the domains from URLs of represen- 
tative articles in each community and aggregating over its 
complete evolution path provides us with a list of content 
sources representing each path. 

At this point, we will have a summary visualization of the 
overall dynamics over time, a set of relevant words and do- 
mains (i.e. publication sources) most representative of each 
evolution path, as well as the capability to drill down to any 
specific time frame and get a list of representative words and 
publication sources for each community at that time. Finally, 
for each community at any time frame, a ranked list of spe- 
cific representative articles and the url to the full article is 
available for an in-depth examination. 

2.3 Evaluation 

Community detection algorithms have been evaluated on 
various randomly generated benchmark graphs with com- 
munity structure (refer to a review paper by Fortunato for 
a summary of these benchmark graphs (Fortunato 2010)). 
Nevertheless, as is the case with our data, typically there 
is no ground truth available or existent. So in this paper 
we devise two methods to evaluate communities and their 
evolution paths. First, we build a simulation model that fol- 
lows mechanisms of a social news website with reason- 
able parameter values, and see how well the algorithm finds 
the "true" community structure based on (empirically unob- 
served) individual positions on an opinion space. In other 
words, we are producing a specific benchmark graph for 
our dataset which includes a ground truth. Next, we eval- 
uate whether the community evolution paths are meaningful 
by measuring source entropy within each path. We will now 
describe these processes in more detail. 

In the first method we begin by assigning each user a 
position on a 2-dimensional Cartesian space that will rep- 
resent the underlying opinion spacq^j Users are randomly 
placed according to a normal distribution around one of four 
equidistant center points in the four quadrants. The position 
of users is considered the ground truth, with each user be- 
longing to one of the four communities specified by the four 
quadrant centers. Given this structure, a k-means algorithm 
that uses the (otherwise unobserved) user positions can find 
the four user clusters with relative ease thus serves as an 
approximate lower bound for error in detecting communi- 
ties. We then generate a set of articles by randomly select- 
ing users who will each post articles and votes. Each gener- 
ated article is positioned in the opinion space according to a 
Gaussian distribution near the user who posts it. Each user 
will vote for an article with some probability, if that article 



is positioned closer than a certain threshold to him/her in the 
political space, thus an article is likely to get a vote if it's 
close to a reader's opinion. The result of this process is a 
set of users, articles, and votes which we then use as a sim- 
ulated graph for a social news platforms. Complete details 
of the simulation parameters and more detail on results are 
available on the website. Pi 

We simulate this data with different variances for the 
aforementioned Gaussian distributions. We then run our 
network-based community detection algorithm on this graph 
and compute relative error as we change the variance of un- 
derlying data generation process (simulation model). Figure 
[3] compares the results of community detection (based on 
votes) with k-means clustering (based on true positions of 
users) as the standard deviation of the Gaussian distribution 
used to generate user positions changes. The algorithm is 
generally robust and successful in finding true underlying 
clusters while error increases with the standard deviation of 
user positions (i.e. as users are more scattered). When the 
value of standard deviation reaches the mid-point between 
the two centers, neither k-means nor the network based al- 
gorithm can detect clusters correctly. This simply means that 
users are distributed such that clear clusters do not exist any- 
more. In addition to the above simulation-based evaluation, 
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10 While for clarity this simulation assumes a 2-dimensional 
opinion space, we make no such assumptions in the general 
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Figure 3: Relative error vs. standard deviation of user posi- 
tions in the opinion space. The jump in the k-means error 
is due to the fact that true user memberships are no longer 
recognizable. Relative error is computed based on pairs of 
users that are classified incorrectly together or separate. 500 
users were generated. Results are based on an average of 10 
simulations. Error bars mark two standard deviations. 

we propose an indirect way of evaluating the full evolution 
path for each community. This method is based on finding 
whether throughout the length of a community's evolution 
path, there is a preference for a few sources of publication. 
Since the votes are cast to completely different articles, there 
should be no expectation that their sources be the same un- 
less users in each evolution path are favoring certain sources 
of information over others, an indication of common under- 
lying preferences. 



methodology. 
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We aggregate the top n representative articles over all the 
time frames in a community evolution path. We then calcu- 
late the Shannon Entropy ( [Shannon et al. 1 949) of the source 
of these articles (as indicated by their domains). This will 
signify the amount of source variation over top preferred ar- 
ticles for each evolution path: 



Entropy (C) = - ^Pilog 2 (Pi 



where pi is the probability that an article from source i is 
in the top n most preferred articles of community C. 

A lower entropy value indicates lower variation and 
higher uniformity in sources of articles. Entropies found 
for evolving communities are then compared to entropies 
from sets of articles drawn at random. We generate the ran- 
dom sets by randomly choosing votes, finding which articles 
the votes were cast for, and then extracting the domain of 
the article. We randomly choose votes rather than randomly 
choosing articles because we want the articles with higher 
votes to have a higher probability of being chosen. This is 
important because the list of most preferred articles in each 
community is also based on the preference of a community's 
users to vote for that article. 

We then compute the effective number of sources in an 
evolution path as 2 Entropy and compare with that of the ran- 
domly selected setq^jand compute the ratio as: 

o Entropy (random) 

Relative Recurrence = — =— -, — —r— 

oh/ntropylpath,) 

A higher recurrence in sources of information compared 
with the randomly drawn dataset will strongly suggest that 
the evolution paths are highly preferential toward certain 
sources, corroborating that they are meaningful. In the next 
sections we will demonstrate this methodology on a real 
dataset, where the above explanations will become more 
clear with example. 

3 Experimental Setup 

3.1 Data Description 

We apply our methodology to a social news website with 
article link submission, voting, and commenting systems. 
The website, named Balatarin (translated The Highest), is 
a mainly Persian-language social news site that played an 
important role during the Iranian post-election protests in 
2009, dubbed the Green Movement. The website became a 
hub for disseminating information, as well as a space for 
people to exchange opinions, propose ideas and even orga- 
nize to take action to protest in the real world. The massive 
uprising marked a turning point in Iranian politics and while 
a great deal of media attention^ was paid to the role of Twit- 
ter in the protests, less consideration was given to Balatarin, 



12 This measure is used in Ecology as the effective number of 
speciesdiill 1973) in an ecosystem. Another metric for diversity 
is found by comparing the effective number of sources with the 
number of unique sources in each set. Using this metric we reached 
similar results. 

13 As an example, read Washington Post article titled "Twitter Is 
a Player In Iran's Drama", published June 17, 2009 



mainly due to the language barrier. Nevertheless, inside Iran 
and within the Persian- speaking population Balatarin was 
among the most prominent social web entities at the time. 
Balatarin is similar to Reddit in that it does not have an ex- 
plicitly defined friendship network, yet similar to Digg in 
that it focuses on positive votes to rank articles. 




All categories 



Figure 4: Timeline of articles posted to the site (top graph is 
all articles, bottom is only articles in the politics category). 
The data spans over 1500 days. 

The dataset includes a total of over 1.2 million articles, 
26,000 users and 3 1 million votes posted from August 2006 
to November 2010. Less than 3% of users are responsible 
for more than 55% of the votes. The articles are tagged ac- 
cording to their category and we will focus our attention on 
the articles in the Politics category (a total of 352,000 ar- 
ticles) since finding trends within non-related content cate- 
gories will not be a meaningful or desirable task. Figure [4] 
shows a timeline of number of articles posted to Balatarin as 
well as the number of articles in Politics. The sudden rise in 
the number of articles coincides with the 2009 protest^) 

To investigate the data over time, we choose a 30-day 
time frame and slide this frame over the duration of the data 
to form temporally consecutive datasets (users, articles, and 
votes in each time frame). Sliding the frame two weeks at a 
time produces 110 time-frames over the whole duration of 
the data. 

3.2 Results 

Figure [5] shows evolving communities over 110 overlapping 
time frames, starting at the launch of the website in 2006 
on top of the figure and progressing downward. Each oval 
shape represents a community and communities placed on 
the same row belong to the same time window P] Size of 
the ovals reflects the number of users in the community 
(community sizes range from 10 users to over 3000 users) 
and communities in consecutive times are connected as de- 
scribed in the previous section. 

Distinct evolution paths of different durations can be ob- 
served and events such as birth, death, merge, split, growth 



The short sharp drop to zero marks a shut-down due to an at- 
tack on the site in February 2009. 

15 Graph was generated using the PyGraphviz library in python. 




Trajectory A 




User Retention: 


High 


Source Recurrence: 


High 


Representative Sources: 


Websites outside Iran (e.g. 




BBC Persian) 


Representative Terms: 


International relations, 




nuclear, sanctions, defense 



June 2007 



Trajectory B 




User Retention: 


Medium 


Source Recurrence: 


Medium 


Representative Sources: 


Iranian conservative websites 


Representative Terms: 


Internal events 



Massive protests dubbed the 
Green Movement 

• Large increase in activity 

• Major reorganization of paths 

• Reduced posting from 
conservative sources 



Traiectorv C 




User Retention: 


High 


Source Recurrence: 


Medium 


Representative Sources: 


Green Movement and old- 




time opposition 


Representative Terms: 


Mixed opposition issues 



Traiectorv D 




User Retention: 


Medium 


Source Recurrence: 


High 


Representative Sources: 


Green Movement sites (e.g. 




kaleme.com) 


Representative Terms: 


Opposition leaders, Slogans, 




Protest locations 




Figure 5: Paths illustrating evolving communities in Balatarin.com. Time begins on top of the figure and progresses downward, 
oval shapes represent communities and their sizes correspond to community size (the horizontal position of the communities is 
merely chosen for ease of visualization). Boxes summarize characteristics of four example paths labeled as A, B, C, and D and 
delineated by large dots on the graph. A significant event (Iranian post-election uprising in June 2009) marks a transition in the 
dynamics of the site. Note that there are several other paths that are observable in this graph, and while we have only chosen 
four as demonstration, other paths are of similar quality to the selected paths. 



and contraction of communities are evident along the paths. 
Furthermore, the effects of the Iranian post-election protests 
in June 2009 is readily evident as a sudden increase in com- 
munity sizes occurs at the onset of the event. This is in 
agreement with the increase in number of articles (Figure 
[4} which almost doubles during this time. In addition, there 
is a shuffling of paths and there are sizable merges and re- 
formation of paths after the event. Thus, similar to its effects 
in the real world, this event has had a significant impact on 
the dynamics of the user population on the site. We choose 
four paths (labeled A,B,C, and D) to investigate further in 
the next sections. These were chosen such that we have a 
number of paths occurring at different times and not due to 
any superiority of quality; other paths are of similar quality 
to the selected paths. These paths are marked on the figure, 
two of them corresponding to a time prior to the June 2009 
event, and two of them belong ing t o a time after the event. 
Following the steps in Section 2.2 we produce representa- 
tive terms and domains for each path. Table [T] lists these re- 
sults. At this stage, we can step into a finer granularity by 
focusing on specific points that may be of interest, such as a 
merge between two communities. Specific terms, domains, 
article summaries, and urls representative of each commu- 
nity are readily available for further investigation through 
simple queries. 

Table 1: Summary of domains and terms associated with 
four example evolution paths. Terms have been translated 
from Persian to English. 



Path 



Domains 



Terms 



www.bbc.co.uk 

www.dw-world.de 

www.roozonline.com 

www.isna.ir 

www.noandish.com 



Nuclear, America, Iran, Peo- 
ple, Republic, Russia, Iraq, Is- 
rael 



www.alef.ir 

www.youtube.com 

www.noandish.com 

www.tabnak.ir 

farhadheyrani.blogspot.com 



Photo, Leader, Torture, Mor- 
tazavi, Prison, 
Father, Child, Public, 
Ahmadinejad 



www.rahesabz.net 

zamaaneh.com 

www.radiofarda.com 

www.dw-world.de 

news.gooya.com 



Prison, Participation, Rights, 
Karroubi, Government, Coun- 
try, Political, Arrest 



www.youtube.com 

w w w.kaleme . com 

www.rahesabz.net 

iarandoost657.blogspot.com 

gomnamian.blogspot.com 



MirHossein, 
Allah-o-Akbar, 
Mousavi, Islamic, 
Slogan, Square, 
Mehdi [Karroubi] 



Security, 



nities demonstrate strong preference for certain websites to 
the point where the links most associated with that com- 
munity all belonged to the same domain. An example is a 
community of 18 users in January 2009 whose top 10 pre- 
ferred articles all came from the pro-government website 
(www.fararu.com) and had very high graph density (all users 
in the community voted exactly the same way). This occa- 
sional extreme uniformity in source of articles (as evident by 
the domain) hints to a possibly organized off- site effort by a 
group of users somehow affiliated with the website, paid by 
the same entity, or dedicated to advocating a cause. 

User Retention Since our goal is to group users solely 
based on their vote similarity, we do not define and utilize 
core users to find evolving communities (a few papers have 
proposed finding communities based on core users ( Seifi and 
|Guillaume 2012] ) ( |Wang, Wu, and Pei 2008) ). Therefore, be- 
cause evolution paths are inferred by computing transition 
probabilities between consecutive time frames, it is not clear 
whether the paths will remain meaningful and consistent af- 
ter several time steps. If at each step a number of users leave 
and new users join the community, will any of the same users 
remain after several time frames? Will there still be content 
coherence within the whole path? Will it be reasonable to 
assume this is the same evolving community after so many 
time steps? To answer these questions we compute user re- 
tention by studying membership within a path across several 
time steps. For a path P, we compute user retention after Ar 
periods from time t\ as: 



Retention(P, Ar) 



n(P(r i )nP(r i + Ar)) 

WW)) 



where P(ri) is the set of users in path P at time r^. 
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Figure 6: User retention (average fraction of users remaining 
in path) vs. Ar for community evolution paths A, B, C, D 



3.3 Evaluation 

Manual inspection shows evidence of similarity of prefer- 
ence between users in each community both in text and in 
sources of content. In some extreme cases small commu- 



Figure [6] illustrates the average retention for different Ar 
values (retention is averaged for rfs spanning the whole 
path) for four different evolution paths. Note that two of the 
chosen paths are longer (spanning more than one year) and 
two are shorter. The results show that evolution paths have 



reasonably high user retention. Paths A and B have more 
than 50% user retention within 3 months, and between 20%- 
40% user retention after 1 year (24 evolutionary cycles in 
on our algorithm). In the worst case, 20% of the users are 
voting similarly after one year which is significant consid- 
ering the fact that these communities are not based on any 
explicit connections and that within a year there is natural 
turn-around in a site's user population. 

Table 2: Relative recurrence of domains in each path, 
demonstrating that paths are highly preferential toward cer- 
tain content sources. 



Path 



B 



D 



Relative recurrence I 3.60 I 1.74 I 2.31 I 2.41 



Source Recurrence Table |2] lists relative recurrence of 
sources within each of t he fo ur selected paths A,B,C, and 
D as explained in Section [23] We observe that all four evo- 
lution paths have an increase in recurrence of information 
sources. We see as much as 3.6 times more recurrence of 
sources compared to the set drawn randomly (proportion- 
ally to an article's votes), demonstrating strong preferences 
toward some sources of information. 

4 Discussion of Results 

We will now combine all the information that can be gleaned 
from the paths and their summaries as produced through 
our automated and unsupervised process. We focus on the 
selected paths A,B,C, and D as labeled on Figure [5] and 
summarized in Table Q] Note that there are several other 
paths and we have only chosen to focus on these four as 
examples. We can see that the massive protests, dubbed the 
Green Movement, that took place after the Iranian presiden- 
tial elections on June 12th 2009 create a significant disrup- 
tion in the organization of paths on Balatarin. Paths A and 
B are prior to this disruptive event, whereas C and D occur 
after this event. Also, paths A and B overlap in time and path 
D overlaps with the end of path C, therefore we can compare 
and contrast them respectively. 

• Path A: Table [T] shows that this path is formed mainly 
around the issues of Iran's international relations, includ- 
ing the nuclear issue, as well as relations with the US, 
Russia and Europe, Israeli-Palestinian conflict, and the 
Iraq war (terms related to sanctions and defense are also 
among the top terms but are not listed in the table due to 
space limitations). This path favors articles from promi- 
nent news agencies outside Iran (which the Iranian author- 
ities generally do not approve of) such as the BBC Per- 
sian^] and the Germany-based Deutsche Wellq^Jbut also 
articles from some news agencies within Iran such as that 
of Iranian Students' News Agency p] which at the time 
published content close to reformist groups. This path has 



16 



www.bbc.co.uk 
7 www.dw-world.de 



high user retention and high recurrence of its sources of 
content. 

• Path B: This path is focused more on Iranian internal is- 
sues and while it does not demonstrate strong loyalty to 
specific domains, it favors articles from conservative web- 
sites inside Iran (www.alef.ir and www.tabnak.ir both be- 
long to conservative Iranian statesmen). This path shows 
more variability in sources of content and its top domains 
include websites belonging to reformists (rivals to the 
conservatives) as well. Path B shows the lowest user re- 
tention among all four example paths. We observe that the 
paths before the election were more issue based, focusing 
on international versus internal issues. 

• The Green Movement: This major external event oc- 
curred in June 2009 when the Iranian government vio- 
lently crushed large protests. We observe a significant re- 
organization in paths and their contents. The presence of 
government news sources and conservative Iranian web- 
sites (such as those in path B) has almost vanished in ma- 
jor paths. While these websites do appear in smaller in- 
termittent communities, the communities fail to continu- 
ously stay active and create a path. We found that a num- 
ber of users from Path B were absorbed into other paths 
(possibly because of a change in their political position). 

• Path C: This is a long-lasting path that favors news 
and analysis from news agencies outside Iran as well 
as sites affiliated with the Green Movement (e.g. 
www.rahesabz.net). Although very much focused on the 
aftermath of the Green Movement, this path demonstrates 
more diversity in its sources of content. While this path is 
of similar length to path A, its user retention drops faster 
than that of path A after one year. 

• Path D: This path demonstrates clear political leanings 
through its very high recurrence of content sources that 
are well-known websites affiliated with the Green Move- 
ment (www.kaleme.com and www.rahesabz.net) and its 
text tends toward names of Green Movement leaders, slo- 
gans, and protest locations as evident in Table[T] Although 
of similar political orientation, a difference between paths 
C and D is that path C is more focused on news and anal- 
ysis from established news agencies whereas path D has 
a preference for blogs and youtube videos, and in terms 
of content it focuses on eyewitness accounts and protest 
organization. Some of these weblogs seem to have been 
created solely for reporting specific protests and may have 
few posts, some others have been shut down. 

In addition to the above summaries, because domain and 
term rankings are created for each community at a time 
frame, the results provide a multi-scale capability where we 
can in fact focus further on summaries of a path at a certain 
time frame and compare dominant themes across different 
times as needed. 

5 Related Work 

Clustering and community detection methods are in 
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essence network summarizatio n tools ([Girvan and New- 
|man 2002] )( |Clauset, Newman, and Moore 2004| ) ( |BloiT 



del, Guillaume, and Lefebvre 2008 MGuimera, Sales-Pardo, 
and Amaral 2 007 )( Barber 2007). A survey paper by ( For- 
tunate) 2010| ) provides a comprehensive summary of this 
field. Building on this literature, a growing body of work 
has been produced on community evolution, varying from 
works on evolutionary clustering (|Palla, B arabasi, and Vic- 
sek 2007| )( [Chakrabarti, Kumar, and Tomkins 2006| )( |Wu et 
al. 2009|) and communtiy detection in dynamic social net 



works (Tantipathananandh, Berger-Wolf, and Kempe 2007 ) 
to process es that also opt imize for smoothness in temporal 
evolution ( Lin et al. 2008|) or use commun ity cores in evolv- 
ing networks (Seifi and Guillaume 2012). A categorization 
and review of community evolution methods is presented by 
(Giatsogl ou and Vakali 2012) ). These works focus solely on 
the network structure. 

A number of recent papers focus on using content alone to 
create summaries of text, such as opi nions ([Ganesan, Zhai,[ 
and Viegas 2012| ), product reviews (Liu, Hu, and C heng 



2005), politi cal leanings ([Fang et al. 2012)(Kaschesky ; 



Sobkowicz, and Bouchard 2011 )( Jiang and Argamon 2008), 



and news strea ms - more specifically, ( [Shahaf, Guestrin, and| 
|Horvitz 2012| ) create structured summaries of content in the 
form of narrative maps and ( [Ahmed et al. 2011] ) produce 
story-lines of streaming news. The bulk of literature in this 
field uses text-based techniques such as language models 
used in sentiment and subjectivity analyses and topic model- 
ing and are not concerned with user networks. On the other 
hand, incorporating both the n etwork graph and content, ( |Jo,| 
|Hopcroft, and Lagoze 2011] ) use the citation network be- 
tween documents to get a better sum marization of doc ument 
content over time (topic evolution), ( |Lin et al. 2010) track 



popular events in the social web, and (Lin, Sundaram, and 
Kelliher 2009) summarize activity over time. Yet none of 
the mentioned papers produce a comprehensive multi- scale 
map of group behavior among users. 

There is considerable debate whether new online spaces 
promote diversity or through winner- take- all dynamics ex- 
acerbate polarization and conflict in society. While some 
have hailed the promise of democratic effects of the Inter- 
net, others have argued against this notion, asserting that 
such web-based platforms increase interaction among like- 
minded people and reduce contact among people of different 
opinions, leading to fragmentation in society (see for exam- 
ple, ([Levine, Hayduk, and Mattson 2002|) ([Westen 1998| ) and 
( Hindma n 2009| )). ( [Adamic and Glance 2005| ) demonstrate 
political polarization in linking patterns between blogs la- 
beled as liber al or conservative. (Van Alstyne and B rynjolf- 
sson 2005)(R ahmandad and Mahdian 201 1| ) and ( [Marvel et 



al. 201 1) ) propose and simulate models of polarization dy- 
namics in populations, Zhou et al ( [Zhou, Resnick, and Mei 
2011| ) jointly classify Digg users and news articles in one 
of two classes (liberal or conservative) using label propaga- 
tion starting from a small number of labeled users and arti- 
cles. Finally, a se minal work in political science (Poole and 
Rosenthal 1985) models polarization in American politics 
through analyzing roll-call votes by members of congress 
(also see ( Koford 1989J ) on dimensionality of these votes). 



6 Concluding Remarks 

Motivated by the challenge of understanding group behav- 
ior of user populations in large disorderly data, we devised 
a novel summarization methodology that produces a multi- 
scale map of community evolution. The proposed method is 
fully automated and unsupervised and can be widely applied 
to other contexts. We used indicators of user preference for 
content (such as "likes" or "votes") and demonstrated that 
they are a meaningful measure for finding communities in 
multi-issue contexts. The methodology generates profiles of 
evolving communities based on their representative content, 
and evaluates them by measuring recurrence of sources of 
information preferred by their users. 

Evolution paths found for the real-world dataset in this 
paper showed high user retention and in varying degrees 
favored different text and sources of content. We observed 
that recurrence of sources in articles representing an evolv- 
ing community at times quadruples as compared with a ran- 
domly drawn set of articles, corroborating the reliability of 
the detected paths. Last but not least, the methodology pro- 
vides a means to observe data at different granularities by 
producing summaries throughout the evolution path as well 
as within each community in one time frame, allowing ex- 
pert investigators to formulate further inquiries. 
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